Fire up graphlab create


In [1]:
import graphlab

Load some house sales data


In [2]:
sales = graphlab.SFrame('home_data.gl')


[INFO] This non-commercial license of GraphLab Create is assigned to alireza.smr@gmail.comand will expire on September 24, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-93286 - Server binary: /Users/alireza/.graphlab/anaconda/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1446642743.log
[INFO] GraphLab Server Version: 1.6.1

In [1]:
sales


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-d019a263fce9> in <module>()
----> 1 sales

NameError: name 'sales' is not defined

Exploring the data for housing sales


In [8]:
sales.show(view="Scatter Plot", x="sqft_living", y="price")


Canvas is accessible via web browser at the URL: http://localhost:59848/index.html
Opening Canvas in default web browser.

Create Simple Regression Model of sqft_living to price


In [5]:
train_data,test_data = sales.random_split(.8,seed=0)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-afaf3c3bb6d9> in <module>()
----> 1 train_data,test_data = sales.random_split(.8,seed=0)

NameError: name 'sales' is not defined

In [6]:
import graphlab

In [7]:
sales = graphlab.SFrame('home_data.gl')


[INFO] This non-commercial license of GraphLab Create is assigned to alireza.smr@gmail.com and will expire on September 24, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-4709 - Server binary: /Users/alireza/.graphlab/anaconda/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1446902901.log
[INFO] GraphLab Server Version: 1.6.1

In [8]:
sales


Out[8]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
7129300520 2014-10-13 00:00:00+00:00 221900 3 1 1180 5650 1 0
6414100192 2014-12-09 00:00:00+00:00 538000 3 2.25 2570 7242 2 0
5631500400 2015-02-25 00:00:00+00:00 180000 2 1 770 10000 1 0
2487200875 2014-12-09 00:00:00+00:00 604000 4 3 1960 5000 1 0
1954400510 2015-02-18 00:00:00+00:00 510000 3 2 1680 8080 1 0
7237550310 2014-05-12 00:00:00+00:00 1225000 4 4.5 5420 101930 1 0
1321400060 2014-06-27 00:00:00+00:00 257500 3 2.25 1715 6819 2 0
2008000270 2015-01-15 00:00:00+00:00 291850 3 1.5 1060 9711 1 0
2414600126 2015-04-15 00:00:00+00:00 229500 3 1 1780 7470 1 0
3793500160 2015-03-12 00:00:00+00:00 323000 3 2.5 1890 6560 2 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 3 7 1180 0 1955 0 98178 47.51123398
0 3 7 2170 400 1951 1991 98125 47.72102274
0 3 6 770 0 1933 0 98028 47.73792661
0 5 7 1050 910 1965 0 98136 47.52082
0 3 8 1680 0 1987 0 98074 47.61681228
0 3 11 3890 1530 2001 0 98053 47.65611835
0 3 7 1715 0 1995 0 98003 47.30972002
0 3 7 1060 0 1963 0 98198 47.40949984
0 3 7 1050 730 1960 0 98146 47.51229381
0 3 7 1890 0 2003 0 98038 47.36840673
long sqft_living15 sqft_lot15
-122.25677536 1340.0 5650.0
-122.3188624 1690.0 7639.0
-122.23319601 2720.0 8062.0
-122.39318505 1360.0 5000.0
-122.04490059 1800.0 7503.0
-122.00528655 4760.0 101930.0
-122.32704857 2238.0 6819.0
-122.31457273 1650.0 9711.0
-122.33659507 1780.0 8113.0
-122.0308176 2390.0 7570.0
[21613 rows x 21 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [9]:
train_data,test_data = sales.random_split(.8,seed=0)

Build the regression model


In [10]:
sqft_model=graphlab.linear_regression.create(train_data, target='price', features=['sqft_living'])


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 16503
PROGRESS: Number of features          : 1
PROGRESS: Number of unpacked features : 1
PROGRESS: Number of coefficients    : 2
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 2        | 1.023398     | 4357231.530617     | 3146023.385001       | 261524.626678 | 288253.590973   |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+

Evaluate Simple Model


In [12]:
print test_data['price'].mean()


543054.042563

In [14]:
print sqft_model.evaluate(test_data)


{'max_error': 4149587.1862886357, 'rmse': 255174.30834723086}

Visualizing the Prediction


In [15]:
import matplotlib.pyplot as plt
%matplotlib inline

In [16]:
plt.plot(test_data['sqft_living'],test_data['price'],'.',
        test_data['sqft_living'],sqft_model.predict(test_data),'-')


Out[16]:
[<matplotlib.lines.Line2D at 0x106007d50>,
 <matplotlib.lines.Line2D at 0x112701710>]

In [17]:
sqft_model.get('coefficients')


Out[17]:
name index value
(intercept) None -45488.7763763
sqft_living None 281.183173922
[2 rows x 3 columns]

Explore other features in the data


In [18]:
features = ['bedrooms','bathrooms','sqft_living','sqft_lot','floors','zipcode']

In [21]:
sales[features].show()


Canvas is accessible via web browser at the URL: http://localhost:59740/index.html
Opening Canvas in default web browser.

In [22]:
sales.show(view='BoxWhisker Plot', x='zipcode', y='price')


Canvas is updated and available in a tab in the default browser.

Build Regression Model w/ More Features


In [23]:
my_features_model= graphlab.linear_regression.create(train_data,target='price',features=features)


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 16536
PROGRESS: Number of features          : 6
PROGRESS: Number of unpacked features : 6
PROGRESS: Number of coefficients    : 115
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 2        | 0.041302     | 3750846.574966     | 2491462.643862       | 181383.824914 | 195917.626553   |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+

In [27]:
print features


['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

In [29]:
print sqft_model.evaluate(test_data)
print my_features_model.evaluate(test_data)


{'max_error': 4149587.1862886357, 'rmse': 255174.30834723086}
{'max_error': 3460731.684507328, 'rmse': 179169.8862436952}

Apply learned models to prices of 3 houses


In [30]:
house1 = sales[sales['id']=='5309101200']

In [31]:
house1


Out[31]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
5309101200 2014-06-05 00:00:00+00:00 620000 4 2.25 2400 5350 1.5 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 4 7 1460 940 1929 0 98117 47.67632376
long sqft_living15 sqft_lot15
-122.37010126 1250.0 4880.0
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.

In [32]:
print house1['price']


[620000, ... ]

In [34]:
print sqft_model.predict(house1)


[629350.8410362888]

In [35]:
print my_features_model.predict(house1)


[722466.6864358183]

In [36]:
house2 = sales[sales['id']=='1925069082']

In [37]:
print house2


+------------+---------------------------+---------+----------+-----------+-------------+
|     id     |            date           |  price  | bedrooms | bathrooms | sqft_living |
+------------+---------------------------+---------+----------+-----------+-------------+
| 1925069082 | 2015-05-11 00:00:00+00:00 | 2200000 |    5     |    4.25   |     4640    |
+------------+---------------------------+---------+----------+-----------+-------------+
+----------+--------+------------+------+-----------+-------+------------+---------------+
| sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement |
+----------+--------+------------+------+-----------+-------+------------+---------------+
|  22703   |   2    |     1      |  4   |     5     |   8   |    2860    |      1780     |
+----------+--------+------------+------+-----------+-------+------------+---------------+
+----------+--------------+---------+-------------+---------------+---------------+-----+
| yr_built | yr_renovated | zipcode |     lat     |      long     | sqft_living15 | ... |
+----------+--------------+---------+-------------+---------------+---------------+-----+
|   1952   |      0       |  98052  | 47.63925783 | -122.09722322 |     3140.0    | ... |
+----------+--------------+---------+-------------+---------------+---------------+-----+
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.

In [38]:
print house2['price']


[2200000, ... ]

In [39]:
print sqft_model.predict(house2)


[1259201.150621358]

In [40]:
print my_features_model.predict(house2)


[1436501.026331065]

In [46]:
expensiveHouses = sales[sales['zipcode']=='98039']

In [47]:
print expensiveHouses


+------------+---------------------------+---------+----------+-----------+-------------+
|     id     |            date           |  price  | bedrooms | bathrooms | sqft_living |
+------------+---------------------------+---------+----------+-----------+-------------+
| 3625049014 | 2014-08-29 00:00:00+00:00 | 2950000 |    4     |    3.5    |     4860    |
| 2540700110 | 2015-02-12 00:00:00+00:00 | 1905000 |    4     |    3.5    |     4210    |
| 3262300940 | 2014-11-07 00:00:00+00:00 |  875000 |    3     |     1     |     1220    |
| 3262300940 | 2015-02-10 00:00:00+00:00 |  940000 |    3     |     1     |     1220    |
| 6447300265 | 2014-10-14 00:00:00+00:00 | 4000000 |    4     |    5.5    |     7080    |
| 2470100110 | 2014-08-04 00:00:00+00:00 | 5570000 |    5     |    5.75   |     9200    |
| 2210500019 | 2015-03-24 00:00:00+00:00 |  937500 |    3     |     1     |     1320    |
| 6447300345 | 2015-04-06 00:00:00+00:00 | 1160000 |    4     |     3     |     2680    |
| 6447300225 | 2014-11-06 00:00:00+00:00 | 1880000 |    3     |    2.75   |     2620    |
| 2525049148 | 2014-10-07 00:00:00+00:00 | 3418800 |    5     |     5     |     5450    |
+------------+---------------------------+---------+----------+-----------+-------------+
+----------+--------+------------+------+-----------+-------+------------+---------------+
| sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement |
+----------+--------+------------+------+-----------+-------+------------+---------------+
|  23885   |   2    |     0      |  0   |     3     |   12  |    4860    |       0       |
|  18564   |   2    |     0      |  0   |     3     |   11  |    4210    |       0       |
|   8119   |   1    |     0      |  0   |     4     |   7   |    1220    |       0       |
|   8119   |   1    |     0      |  0   |     4     |   7   |    1220    |       0       |
|  16573   |   2    |     0      |  0   |     3     |   12  |    5760    |      1320     |
|  35069   |   2    |     0      |  0   |     3     |   13  |    6200    |      3000     |
|   8500   |   1    |     0      |  0   |     4     |   7   |    1320    |       0       |
|  15438   |   2    |     0      |  2   |     3     |   8   |    2680    |       0       |
|  17919   |   1    |     0      |  1   |     4     |   9   |    2620    |       0       |
|  20412   |   2    |     0      |  0   |     3     |   11  |    5450    |       0       |
+----------+--------+------------+------+-----------+-------+------------+---------------+
+----------+--------------+---------+-------------+---------------+---------------+-----+
| yr_built | yr_renovated | zipcode |     lat     |      long     | sqft_living15 | ... |
+----------+--------------+---------+-------------+---------------+---------------+-----+
|   1996   |      0       |  98039  | 47.61717049 | -122.23040939 |     3580.0    | ... |
|   2001   |      0       |  98039  | 47.62060082 |  -122.2245047 |     3520.0    | ... |
|   1955   |      0       |  98039  | 47.63281908 | -122.23554392 |     1910.0    | ... |
|   1955   |      0       |  98039  | 47.63281908 | -122.23554392 |     1910.0    | ... |
|   2008   |      0       |  98039  | 47.61512031 | -122.22420058 |     3140.0    | ... |
|   2001   |      0       |  98039  | 47.62888314 | -122.23346379 |     3560.0    | ... |
|   1954   |      0       |  98039  | 47.61872888 | -122.22643371 |     2790.0    | ... |
|   1902   |     1956     |  98039  | 47.61089438 | -122.22582388 |     4480.0    | ... |
|   1949   |      0       |  98039  | 47.61435052 | -122.22772057 |     3400.0    | ... |
|   2014   |      0       |  98039  | 47.62087993 | -122.23726918 |     3160.0    | ... |
+----------+--------------+---------+-------------+---------------+---------------+-----+
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.

In [48]:
print expensiveHouses['price'].mean()


2160606.6

In [63]:
fraction_finder = sales[(sales['sqft_living'] >= 2000) & (sales['sqft_living'] <= 4000)]
fraction_finder.show()


Canvas is updated and available in a tab in the default browser.

Building Advance Model


In [55]:
advanced_features = [
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house				
'grade', # measure of quality of construction				
'waterfront', # waterfront property				
'view', # type of view				
'sqft_above', # square feet above ground				
'sqft_basement', # square feet in basement				
'yr_built', # the year built				
'yr_renovated', # the year renovated				
'lat', 'long', # the lat-long of the parcel				
'sqft_living15', # average sq.ft. of 15 nearest neighbors 				
'sqft_lot15', # average lot size of 15 nearest neighbors 
]

In [56]:
print advanced_features


['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode', 'condition', 'grade', 'waterfront', 'view', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'lat', 'long', 'sqft_living15', 'sqft_lot15']

In [58]:
my_advance_model= graphlab.linear_regression.create(train_data,target='price', features = advanced_features)


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 16476
PROGRESS: Number of features          : 18
PROGRESS: Number of unpacked features : 18
PROGRESS: Number of coefficients    : 127
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 2        | 0.031650     | 3468187.219168     | 2197643.599194       | 154092.671887 | 165367.609769   |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+

In [59]:
print sqft_model.evaluate(test_data)
print my_features_model.evaluate(test_data)
print my_advance_model.evaluate(test_data)


{'max_error': 4149587.1862886357, 'rmse': 255174.30834723086}
{'max_error': 3460731.684507328, 'rmse': 179169.8862436952}
{'max_error': 3553748.8054397926, 'rmse': 156678.87771855685}

In [ ]: